1. Overview

Here, demonstrate …

  1. How to download some data from NCBI
  2. What the raw and parsed formats look like and how to format NCBI annotations for GENESPACE
  3. Run GENESPACE using default parameters
  4. The output formats
  5. How to query the results

To do this, we use five genomes from NCBI that span roughly 300M years of vertebrate evolution.

2. Getting into the right environment to run GENESPACE

The best way to run GENESPACE is to open a terminal and enter a conda environment (or similar) where orthofinder is in the path. OrthoFinder can be installed in many ways (see README. You also need to ensure that MCScanX is properly installed. Once you have those completed, from the terminal, enter: open -na rstudio, which will open an Rstudio interface (if installed). Otherwise, just enter an interactive R session by executing R.

Once within R, you can install GENESPACE (if you haven’t done so already).

devtools::install_github("jtlovell/GENESPACE", upgrade = F)

If you have already installed a previous version of GENESPACE you may need to run:

detach("package:GENESPACE", unload = TRUE)
devtools::install_github("jtlovell/GENESPACE", upgrade = F)

Then import GENESPACE into the R environment:

library(GENESPACE)
## GENESPACE v1.1.10: synteny and orthology constrained comparative
##         genomics

3. Get your the data together

Set the paths for your run. Change these paths to those valid on your system

genomeRepo <- "~/path/to/store/rawGenomes"
wd <- "~/path/to/genespace/workingDirectory"
path2mcscanx <- "~/path/to/MCScanX/"

Download the data of interest. Here we are going to use human, mouse, platypus, chicken, and sand lizard. To do this programatically, get the URLs to the genome annotations, but leave off the ‘genomic.gff.gz’ string at the end, then make the full paths by appending ‘translated_cds.faa.gz’ and ‘genomic.gff.gz’ to the ends of the paths. Finally make the directories to write to and download the files.

urls <- c(
  human ="000/001/405/GCF_000001405.40_GRCh38.p14/GCF_000001405.40_GRCh38.p14_",
  mouse = "000/001/635/GCF_000001635.27_GRCm39/GCF_000001635.27_GRCm39_",
  platypus = "004/115/215/GCF_004115215.2_mOrnAna1.pri.v4/GCF_004115215.2_mOrnAna1.pri.v4_",
  chicken = "016/699/485/GCF_016699485.2_bGalGal1.mat.broiler.GRCg7b/GCF_016699485.2_bGalGal1.mat.broiler.GRCg7b_",
  sandLizard = "009/819/535/GCF_009819535.1_rLacAgi1.pri/GCF_009819535.1_rLacAgi1.pri_")

genomes2run <- names(urls)
urls <- file.path("https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF", urls)
translatedCDS <- sprintf("%stranslated_cds.faa.gz", urls)
geneGff <- sprintf("%sgenomic.gff.gz", urls)

names(translatedCDS) <- genomes2run
names(geneGff) <- genomes2run
writeDirs <- file.path(genomeRepo, genomes2run)
names(writeDirs) <- genomes2run
for(i in genomes2run){
  print(i)
  if(!dir.exists(writeDirs[i]))
    dir.create(writeDirs[i])
  download.file(
    url = geneGff[i], 
    destfile = file.path(writeDirs[i], basename(geneGff[i])))
  download.file(
    url = translatedCDS[i], 
    destfile = file.path(writeDirs[i], basename(translatedCDS[i])))
}

4. Parse the annotations

GENESPACE requires that fasta headers exactly match the fourth (ID) column in the bed file. Feel free to do this yourself, or use the GENESPACE function parse_annotations.

genomes2run <- c("human", "mouse", "platypus", "chicken", "sandLizard")
parsedPaths <- parse_annotations(
  rawGenomeRepo = genomeRepo,
  genomeDirs = genomes2run,
  genomeIDs = genomes2run,
  presets = "ncbi",
  genespaceWd = wd)
## human: n unique sequences = 20627, n matched to gff = 20627 (dropped 88 ./- chars)
## mouse: n unique sequences = 22852, n matched to gff = 22852 (dropped 254 ./- chars)
## platypus: n unique sequences = 18252, n matched to gff = 18252
## chicken: n unique sequences = 18093, n matched to gff = 18093 (dropped 1 ./- chars)
## sandLizard: n unique sequences = 20364, n matched to gff = 20364

See APPENDIX A: annotation formats for more information on how the annotations should be stored and prepared.

5 Initialize the GENESPACE run and set parameters

5.1 Calling init_genespace

The function init_genespace does most of the heavy lifting in terms of checking to make sure that the input data is OK. It also produces the correct directory structure and corresponding paths for the GENESPACE run. We have it as a separate function from the main pipeline run_genespace so that we can catch issues with input and give the user flexibility to alter input without re-running the full pipeline.

gpar <- init_genespace(
  wd = wd,
  path2mcscanx = path2mcscanx)
## Checking Working Directory ... PASS: `~/Desktop/gs_v1_runs/test4riptut/run`
## Checking user-defined parameters ...
##  Genome IDs & ploidy ... 
##      chicken   : 1
##      human     : 1
##      mouse     : 1
##      platypus  : 1
##      sandLizard: 1
##  Outgroup ... NONE
##  n. parallel processes ... 5
##  collinear block size ... 5
##  collinear block search radius ... 25
##  n gaps in collinear block ... 5
##  synteny buffer size... 100
##  only orthogroups hits as anchors ... TRUE
##  n secondary hits ... 0
## Checking annotation files (.bed and peptide .fa):
##  chicken   : 18093 / 18093 geneIDs exactly match (PASS)
##  human     : 20627 / 20627 geneIDs exactly match (PASS)
##  mouse     : 22852 / 22852 geneIDs exactly match (PASS)
##  platypus  : 18252 / 18252 geneIDs exactly match (PASS)
##  sandLizard: 20364 / 20364 geneIDs exactly match (PASS)
## Checking dependencies ...
##  Found valid path to OrthoFinder v2.54: `orthofinder`
##  Found valid path to DIAMOND2 v2.014: `diamond`
##  Found valid MCScanX_h executable: `/Users/lovell/Desktop/programs/MCScanX//MCScanX_h`

See APPENDIX B GENESPACE parameters for more information on how the run can be customized. However, the defaults should work for most circumstances.

5.2 GENESPACE directory structure

When called, init_genespace first checks that the annotations conform as above in the /bed and /peptide directories of the genespace working directory wd. It then builds out the proper directory structure for the GENESPACE run, which are initially empty. The directory is structed as follows:

/wd
└───bed            # see above - where the parsed bed annotations go
└───peptide        # see above - where the parsed peptide fastas go
└───orthofinder    # where the raw orthofinder results reside
└───results        # where the parsed orthofinder results and other output
└───syntenicHits   # parsed blast hits and syntenic hits stored here
└───dotplots       # all blast and syntenic block dotplots 
└───tmp            # special directory for temporary or internal files
└───pangenes       # intermediate and final pan-gene construction data
└───riparian       # riparian plots and their source data incl. phased blocks

6. Conduct the GENESPACE run

Using default settings, depending on your machine, this can take a few minutes up to an hour.

GENESPACE prints a ton of output. We don’t provide a mechanism to suppress this because it is crucial to make sure things are looking ok. There are some descriptions about the text printed to the console (see below output …)

out <- run_genespace(gpar, overwrite = T)
## 
## ############################
## 1. Running orthofinder (or parsing existing results)
##  Checking for existing orthofinder results ...
##  ... found existing run, not re-running orthofinder
##         re-ordering genomeIDs by the species tree: human, mouse,
##                 platypus, chicken, sandLizard
## 
## ############################
## 2. Combining and annotating bed files w/ OGs and tandem array info ...
##  ##############
##  Flagging chrs. w/ < 10 unique orthogroups
##  ...chicken   : 475 genes on 55 small chrs. 
##      ...human     :   7 genes on  5 small chrs. 
##      ...mouse     :   1 genes on  1 small chrs. 
##      ...platypus  : 197 genes on 50 small chrs. 
##      ...sandLizard:   2 genes on  2 small chrs. 
##  ##############
##  Flagging over-dispered OGs
##  ...chicken   : 167 genes in 6 OGs hit > 8 unique places 
##      ...human     : 148 genes in 5 OGs hit > 8 unique places 
##      ...mouse     : 156 genes in 5 OGs hit > 8 unique places 
##      ...platypus  : 345 genes in 6 OGs hit > 8 unique places 
##      ...sandLizard: 438 genes in 8 OGs hit > 8 unique places 
##  ##############
##  Annotation summaries (after exclusions):
##  ...chicken   : 17451 genes in 15561 OGs || 2076 genes in 389 arrays
##      ...human     : 20473 genes in 17730 OGs || 3045 genes in 835 arrays
##      ...mouse     : 22695 genes in 18048 OGs || 5093 genes in 956 arrays
##      ...platypus  : 17815 genes in 16296 OGs || 1671 genes in 494 arrays
##      ...sandLizard: 19924 genes in 17093 OGs || 2928 genes in 778 arrays
## 
## ############################
## 3. Combining and annotating the blast files with orthogroup info ...
##  # Chunk 1 / 3 (12:44:39) ... 
##  ...mouse v. mouse:           total hits = 271842, same og = 82158
##      ...sandLizard v. sandLizard: total hits = 228821, same og = 52316
##      ...mouse v. sandLizard:      total hits = 292554, same og = 18892
##      ...mouse v. human:           total hits = 290818, same og = 28831
##      ...mouse v. platypus:        total hits = 276135, same og = 21848
##  # Chunk 2 / 3 (12:44:42) ... 
##  ...sandLizard v. human:      total hits = 263514, same og = 18331
##      ...human v. human:           total hits = 220601, same og = 47634
##      ...sandLizard v. platypus:   total hits = 251588, same og = 17834
##      ...human v. platypus:        total hits = 240331, same og = 22339
##      ...platypus v. platypus:     total hits = 188424, same og = 38043
##  # Chunk 3 / 3 (12:44:44) ... 
##  ...sandLizard v. chicken:    total hits = 233566, same og = 16908
##      ...mouse v. chicken:         total hits = 249754, same og = 15350
##      ...chicken v. chicken:       total hits = 190677, same og = 56494
##      ...human v. chicken:         total hits = 224848, same og = 15143
##      ...platypus v. chicken:      total hits = 207314, same og = 14896
##  ##############
##  Generating dotplots for all hits ... Done!
## 
## ############################
## 4. Flagging synteny for each pair of genomes ...
##  # Chunk 1 / 3 (12:45:22) ... 
##  ...mouse v. human:           22261 hits (15782 anchors) in 357 blocks (250 SVs, 233 regions)
##      ...human v. platypus:        18500 hits (13668 anchors) in 593 blocks (447 SVs, 395 regions)
##      ...mouse v. platypus:        18295 hits (13622 anchors) in 604 blocks (410 SVs, 453 regions)
##      ...mouse v. sandLizard:      17586 hits (12649 anchors) in 746 blocks (595 SVs, 467 regions)
##      ...sandLizard v. human:      17543 hits (12572 anchors) in 684 blocks (581 SVs, 382 regions)
##  # Chunk 2 / 3 (12:45:48) ... 
##  ...sandLizard v. platypus:   17530 hits (12931 anchors) in 677 blocks (550 SVs, 413 regions)
##      ...sandLizard v. chicken:    16770 hits (12542 anchors) in 496 blocks (436 SVs, 254 regions)
##      ...mouse v. chicken:         15868 hits (11705 anchors) in 660 blocks (524 SVs, 436 regions)
##      ...human v. chicken:         15761 hits (11755 anchors) in 589 blocks (508 SVs, 360 regions)
##      ...platypus v. chicken:      15794 hits (11845 anchors) in 593 blocks (491 SVs, 364 regions)
##  # Chunk 3 / 3 (12:46:10) ... 
##  ...mouse v. mouse:           103561 hits (22798 anchors) in 23 blocks (0 SVs, 0 regions)
##      ...chicken v. chicken:       56904 hits (18084 anchors) in 96 blocks (0 SVs, 0 regions)
##      ...sandLizard v. sandLizard: 60533 hits (20364 anchors) in 23 blocks (0 SVs, 0 regions)
##      ...human v. human:           58374 hits (20571 anchors) in 30 blocks (0 SVs, 0 regions)
##      ...platypus v. platypus:     40713 hits (18240 anchors) in 82 blocks (0 SVs, 0 regions)
## 
## ############################
## 5. Building synteny-constrained orthogroups ...
##  Done!
## 
## ############################
## 6. Integrating syntenic positions across genomes ...
##  ##############
##  Generating syntenic dotplots ... Done!
##  ##############
##  Interpolating syntenic positions of genes ... 
##  human:  (0 / 1 / 2 / >2 syntenic positions)
##      chicken   :  1823 / 13763 /   178 /     0
##      human     :     0 / 20627 /     0 /     0
##      mouse     :   621 / 17880 /    57 /     0
##      platypus  :  1288 / 15094 /   249 /     7
##      sandLizard:  2368 / 15095 /   311 /     0
##  mouse:  (0 / 1 / 2 / >2 syntenic positions)
##      chicken   :  1845 / 13568 /   350 /     1
##      human     :   695 / 17402 /   166 /     0
##      mouse     :     0 / 22852 /     0 /     0
##      platypus  :  1455 / 14952 /   222 /     9
##      sandLizard:  2336 / 14899 /   531 /     8
##  platypus:  (0 / 1 / 2 / >2 syntenic positions)
##      chicken   :  1696 / 13699 /   357 /    12
##      human     :  1854 / 16060 /   344 /     5
##      mouse     :  2160 / 16169 /   229 /     0
##      platypus  :     0 / 18252 /     0 /     0
##      sandLizard:  2004 / 15339 /   425 /     6
##  chicken:  (0 / 1 / 2 / >2 syntenic positions)
##      chicken   :     0 / 18089 /     0 /     0
##      human     :  3477 / 14638 /   148 /     0
##      mouse     :  3674 / 14684 /   182 /    18
##      platypus  :  2628 / 13909 /   101 /     0
##      sandLizard:  2537 / 15104 /   133 /     0
##  sandLizard:  (0 / 1 / 2 / >2 syntenic positions)
##      chicken   :  1344 / 14377 /    43 /     0
##      human     :  2729 / 15204 /   325 /     5
##      mouse     :  2999 / 15269 /   290 /     0
##      platypus  :  1823 / 14519 /   296 /     0
##      sandLizard:     0 / 20364 /     0 /     0
##  Done!
## 
## ############################
## 7. Final block coordinate calculation and riparian plotting ...
##  ##############
##  Calculating syntenic blocks by reference chromosomes ... 
##      n regions (aggregated by 25 gene radius): 7658
##      n blocks (collinear sets of > 5 genes): 11188
##  ##############
##  Building ref.-phased blks and riparian plots for haploid genomes:
##      chicken   : 7593 phased blocks
##      human     : 8203 phased blocks
##      mouse     : 9229 phased blocks
##      platypus  : 8991 phased blocks
##      sandLizard: 8127 phased blocks
##  Done!
## 
## ############################
## 8. Constructing syntenic pan-gene sets ...
##  human     : n pos. = 27157, synOgs = 91720, array mem. = 8632, NS orthos 21192
##  mouse     : n pos. = 27235, synOgs = 91127, array mem. = 9444, NS orthos 21332
##  platypus  : n pos. = 27290, synOgs = 93067, array mem. = 7680, NS orthos 21341
##  chicken   : n pos. = 27185, synOgs = 93991, array mem. = 6510, NS orthos 21431
##  sandLizard: n pos. = 27245, synOgs = 93422, array mem. = 7368, NS orthos 21196
## 
## ############################
## GENESPACE run complete!  All results are stored in
## /Users/lovell/Desktop/gs_v1_runs/test4riptut/run in the following
## subdirectories:
##  syntenic block dotplots: /dotplots (...synHits.pdf)
##  annotated blast files  : /syntenicHits
##  annotated/combined bed : /results/combBed.txt
##  syntenic block coords. : /results/blkCoords.txt
##  syn. blk. by ref genome: /riparian/refPhasedBlkCoords.txt
##  pan-genome annotations : /pangenes (...pangenes.txt.gz)
##  riparian plots         : /riparian
##  genespace param. list  : /results/gsParams.rda
## ############################
## **NOTE** the genespace parameter object is returned or can be loaded
##         into R via
##         `load('/Users/lovell/Desktop/gs_v1_runs/test4riptut/run/results/gsParams.rda',
##         verbose = TRUE)`. Then you can customize your riparian plots by
##         calling `plot_riparian(gsParam = gsParam, ...)`. The source
##         data and ggplot2 objects are also stored in the /riparian
##         directory and can also be accessed by `load(...)`.
## **NOTE** To query genespace results by position or gene, use
##         `query_genespace(...)`. See specifications in ?query_genespace
##         for details.
## ############################

See APPENDIX C: GENESPACE synteny methods and APPENDIX D: run_genespace pipeline for more details about what GENESPACE does under the hood. Information in APPENDIX C explaining the verbose output printed to the console too.

7. Visualizing global synteny

We can visualize synteny in two ways:

7.1 riparian plots

If you have more than 2 genomes, plot_riparian give you access to an information-dense synteny map. See the riparian guide tutorial for more information.

ripd <- plot_riparian(
  gsParam = out,
  refGenome = "human", 
  useRegions = FALSE)

7.2 dotplots

For just a pair of genomes, its often useful to visualize global patterns of synteny through pairwise x-y dotplots. We can do this with GENESPACE using ggdotplot. This produces three dotplots: (1) all hits above a score threshold, (2) all hits where the query and target are in the same orthogroup, (3) syntenic anchor hits colored by syntenic block ID. The first two are heatmaps to reduce file size. The brighter the color, the more hits underlie that point. The title explains the binning and thresholding for the heatmaps.

hits <- read_allBlast(
  filepath = file.path(out$paths$syntenicHits, 
                       "mouse_vs_human.allBlast.txt.gz"))
ggdotplot(hits = hits, type = "all", verbose = FALSE)

7. Querying the output

Currently, GENESPACE has three querying utilities: block coordinates, pan-genes and syntenic hits. In all cases, the input is the GENESPACE parameter object and a bed file with the coordinates of the region of interest.

7.1 Specifying a set of regions of interest

Here, we’ll do the query on a single bed file-like data.frame object containing three regions of interest:

roi <- data.frame(
  genome = c("human", "human", "chicken"),
  chr = c("6", "X", "Z"),
  start = c(3.8e6, 0, 0),
  end = c(4.5e6, Inf, 1e6))

7.2 Pulling hits against a reference for a specific region

For each line in the regions of interest bed parameters, we extract all hits that map to it. We can either pull ALL hits or just those in synteny. Since synteny is so strong in this set, its reasonable to look only at syntenic hits

qreturn <- query_hits(gsParam = out, bed = roi, synOnly = TRUE)

7.3 Making dotplots from the hits

We offer gghits as a method to make a dotplot from hits stored in memory as a data.table. We can do this here for one of the regions of interest: just human and mouse X chr:

xvx <- subset(qreturn[["human, X: 0-Inf"]], genome2 == "mouse")
gghits(xvx, useOrder = F)

And just human v. the chromosome that the human X is syntenic to in chicken:

xvx <- subset(qreturn[["human, X: 0-Inf"]], genome2 == "chicken")
gghits(xvx, useOrder = T)

7.4 Querying pan-genes

A set of pan-genes are in the same orthogroup and can be placed into a synteny position against a reference genome. We can extract these as above using query_pangenes:

test <- query_pangenes(
  gsParam = out, bed = roi)

Here is what a few lines of each region looks like (‘…’ indicate more genes are there, but for similicity, we are just showing the first):

## [1] "human, 6: 3800000-4500000"
##   pgID interpChr interpOrd        human        mouse     platypus chicken   sandLizard
## 1 8418         6    7154.0       FAM50B       Fam50b      FAM50A*              FAM50A*
## 2 8419         6    7154.5                                                LOC117049681
## 3 8420         6    7154.5      ZNF219*      Zfp219* LOC100079925              ZNF219*
## 4 8421         6    7157.0 HLA-DQB2.... H2-Eb1, .... LOC10008....                     
## 5 8422         6    7158.0       PRPF4B       Prpf4b PRPF4B, ....  PRPF4B       PRPF4B
## [1] "human, X: 0-Inf"
##   pgID interpChr interpOrd        human    mouse     platypus chicken sandLizard
## 1    1         X         2 LOC12490.... Ppp2r3d*      PPP2R3B PPP2R3B    PPP2R3B
## 2    2         X         3       TCEAL6                                         
## 3    3         X         4         NKRF    Nkrf* LOC10007....   NKRF*      NKRF*
## 4    4         X         5      CCDC120 Ccdc120*     CCDC120*           CCDC120*
## 5    5         X         6       CT47B1                                         
## [1] "chicken, Z: 0-1e+06"
##   pgID interpChr interpOrd  human  mouse platypus      chicken sandLizard
## 1   34         Z  110.0000 MYO5C* Myo5c*   MYO5C* LOC121108491     MYO5C*
## 2   35         Z  111.0000   SKA1   Ska1     SKA1 SKA1, LO....       SKA1
## 3   36         Z  111.3333  MAPK4  Mapk4    MAPK4                   MAPK4
## 4   37         Z  111.6667    MRO    Mro      MRO                        
## 5   38         Z  112.0000    ME2    Me2      ME2 LOC10705....        ME2

7.5 riparian plots for regions of interest

See the riparian plotting guide for more details, but we can add a color column to the bed and only look at these regions, either with the full synteny map as a background:

roibed <- roi[,c("genome", "chr")]
roibed$color <- c("pink", "cyan", "green")
ripd <- plot_riparian(
  gsParam = out, 
  useRegions = FALSE, 
  highlightBed = roibed)

Or just looking at the focal regions of interest:

ripd <- plot_riparian(
  gsParam = out,  
  useRegions = FALSE,
  highlightBed = roibed, 
  backgroundColor = NULL)

7.6 Output

See APPENDIX E: output formats for more detail on the raw data output from GENESPACE.

Appendicies

APPENDIX A: annotation formats

A.1 Local raw data repository structure for parse_annotations

This reads raw ncbi-formatted files stored in separate folders for each genome genomeRepo. Here, this looks like this:

/genomeRepo
└───human
│   │   GCF_000001405.40_GRCh38.p14_translated_cds.faa.gz
│   │   GCF_000001405.40_GRCh38.p14_genomic.gff.gz
└───mouse
│   │   GCF_000001635.27_GRCm39_translated_cds.faa.gz
│   │   GCF_000001635.27_GRCm39_genomic.gff.gz
└───platypus
│   │   GCF_004115215.2_mOrnAna1.pri.v4_translated_cds.faa.gz
│   │   GCF_004115215.2_mOrnAna1.pri.v4_genomic.gff.gz
└───chicken
│   │   GCF_016699485.2_bGalGal1.mat.broiler.GRCg7b_translated_cds.faa.gz
│   │   GCF_016699485.2_bGalGal1.mat.broiler.GRCg7b_genomic.gff.gz
└───sandLizard
│   │   GCF_009819535.1_rLacAgi1.pri_translated_cds.faa.gz
│   │   GCF_009819535.1_rLacAgi1.pri_genomic.gff.gz

A.2 Parsed annotation storage and directory structure

It then takes these files, formats then re-names them into the GENESPACE working directory wd. The peptide fasta file stored in

/wd
└───bed
│   │   human.bed
│   │   mouse.bed
│   │   platypus.bed
│   │   chicken.bed
│   │   sandLizard.bed
└───peptide
│   │   human.fa
│   │   mouse.fa
│   │   platypus.fa
│   │   chicken.fa
│   │   sandLizard.fa

A.3 Parsed annotation format

The formatted files look like this for the human genome:

The bed file, extracted from the gene gff3 annotation, which has chromosome, start, end and id information for each gene (row. )

19   58345182   58353492   A1BG
10   50799408   50885627   A1CF
12   9067707    9116229    A2M
12   8822620    8887459    A2ML1
1    33306765   33321098   A3GALT2
22   42692120   42721301   A4GALT
3    138123712  138133302  A4GNT
12   53307459   53321610   AAAS
12   125065434  125143316  AACS
3    151814115  151828488  AADAC

The peptide fasta, simplified from the translatedCDS fasta:

>A1BG
MSMLVVFLLLWGVTWGPVTEAAIFYETQPSLWAESESLLKPLANVTLTCQAHLETPDFQLFKNGVAQEPVHLDSPAI...
>A1CF
MEAVCLGTCPEPEASMSTAIPGLKKGNNALQSIILQTLLEKENGQRKYGGPPPGWDAAPPERGCEIFIGKLPRDLFE...
>A2M
MGKNKLLHPSLVLLLLVLLPTDASVSGKPQYMVLVPSLLHTETTEKGCVLLSYLNETVTVSASLESVRGNRSLFTDL...
>A2ML1
MWAQLLLGMLALSPAIAEELPNYLVTLPARLNFPSVQKVCLDLSPGYSDVKFTVTLETKDKTQKLLEYSGLKKRHLH...
>A3GALT2
MALKEGLRAWKRIFWRQILLTLGLLGLFLYGLPKFRHLEALIPMGVCPSATMSQLRDNFTGALRPWARPEVLTCTPW...
>A4GALT
MSKPPDLLLRLLRGAPRQRVCTLFIIGFKFTFFVSIMIYWHVVGEPKEKGQLYNLPAEIPCPTLTPPTPPSHGPTPG...
>A4GNT
MRKELQLSLSVTLLLVCGFLYQFTLKSSCLFCLPSFKSHQGLEALLSHRRGIVFLETSERMEPPHLVSCSVESAAKI...
>AAAS
MCSLGLFPPPPPRGQVTLYEHNNELVTGSSYESPPPDFRGQWINLPVLQLTKDPLKTPGRLDHGTRTAFIHHREQVW...
>AACS
MSKEERPGREEILECQVMWEPDSKKNTQMDRFRAAVGAACGLALESYDDLYHWSVESYSDFWAEFWKFSGIVFSRVY...
>AADAC
MGRKSLYLLIVGILIAYYIYTPLPDNVEEPWRMMWINAHLKTIQNLATFVELLGLHHFMDSFKVVGSFDEVPPTSDE...

APPENDIX B GENESPACE parameters

For more details, see ?init_genespace. In general, the user should not have to adjust any of these, as GENESPACE synteny tends to conform well to the phylogenetic structure of the run. The parameters are left here so that users can sufficiently modify the GENESPACE run for edge cases (described below). Here short descriptions of all parameters that can be given to init_genespace:

B.1 Describing the run …

  • genomeIDs: the genome IDs that match files in /bed and /peptide. These must begin with a letter (a-z, A-Z) and should not include special characters. ‘.’ and ’_’ are OK.
  • ploidy: What is the ploidy of genome assemblies. This is usually half of the actual ploidy, that is an inbred diploid usually is represented by a haploid genome assembly. This is an important parameter that determines the number of top-scoring blast hits in orthogroups to be included in the synteny search.
  • ignoreTheseGenomes and outgroup: arguments are synonyms. Typically only used if there is a whole-genome duplication that predates the divergence of the genomes of interest, but the user wants to include homeologs (paralogs resulting from a WGD) among the focal species. This lets you specify which of the genomeIDs should be used in the orthofinder run but not in the synteny search.
  • orthofinderInBlk: This is the most important GENESPACE parameter, which specifies whether orthgroups should be re-called within syntenic blocks (TRUE) or if orthogroups should be inferred globally logical. Typically, if there are no WGDs in the history of the genomes, this results in very little imporvement in performance. However, it can significantly improve detection of homeologs if there is a WGD. Thus, default is FALSE unless any of the genomes have ploidy > 1.

B.2 How synteny is detected …

  • useHOGs sets whether phylogenetically hierarchical orthogroups (HOGs) or raw orthogroups are used. Since original -og orthogroups are explicitly deprecated under the current OrthoFinder release, this should be set to TRUE except in very rare cases.
  • nGaps: the MCScanX -m parameter, specifying the number of gaps in the directed acyclic graph. It is not recommended to mess with this parameters, but larger numbers may results in more paralogous blocks. Defaults to 5.
  • blkSize: the MCScanX -s and dbscan(minpts = ...) parameters specifying the minimum block size. Increasing this can cause much more stringent blocks. Defaults to 5.
  • blkRadius: the search radius with which to aggregate syntenic hits into ‘regions’. This is the best parameter to mess with if you want more detail (lower values) or more simple (higher values) synteny maps. Defaults to 25.
  • synBuff: the buffer around syntenic anchor hits (perfectly collinear hits in blocks) used for initial synteny searches. Larger values may let you extend to adjacent paralagous regions.
  • onlyOgAnchors: sets whether the query and target genes in syntenic anchors have to be in the same orthogroup. Defaults to TRUE and probably should stay that way except in rare cases.
  • onlySameChrs logical - should synteny be only considered between chromosomes with the same names? Not recommended except with recently diverged polyploids where you want to avoid spurious syntenic matching between homeologous chromosomes. Also can use this if just doing a run among synteny-built genomes, where you know there are no inter-chromosomal SVs. Can dramatically improve speed among very large genomes.
  • maxOgPlaces specifies the max number of unique placements that an orthogroup can have before being excluded from synteny
  • arrayJump specifies the maximum distance in gene rank order between two genes in the same tandem array

B.3 Treatment of ‘secondary’ hits

In some cases it may be impossible to include an outgroup to study paralogs. In this case, GENESPACE offers a set of parameters that facilitate searches after masking the primary synteny map: nSecondHits/nSecondaryHits is a scaling parameter on ploidy, specifying how many additional hits should be included. nGapsSecond, blkSizeSecond, blkRadiusSecond, synBuffSecond, onlyOgAnchorsSecond follow above, but for secondary hits. maskBuffer numeric (default = 500), the minimum distance that a secondary (or homeolog w/in polyploid genome) block can be created relative to an existing block.

B.4 General system parameters

  • nCores the number of parallel processes to run … defaults to half of those available on the machine. Most processes are parallelized in GENESPACE.
  • path2orthofinder character string coercible to a file path that points to the orthofinder executable. If orthofinder is in the path, specify with “orthofinder”
  • path2diamond character string coercible to a file path that points to the diamond executable. If diamond is in the path, specify with “diamond”. Since diamond is included in the OrthoFinder install, if OrthoFinder is in the $PATH, then so should diamond.
  • path2mcscanx The path to the MCScanx installation. This must contain the ‘MCScanX_h’ folder.
  • onewayBlast logical of length 1, specifying whether one-way blasts should be run via orthofinder -1 .... This replaces orthofinderMethod = “fast”, but uses diamond2 --more-sensitive whereas the previous method used –fast specification. Substantial speed improvements in large runs with little loss of fidelity. However, use carefully since full testing has not been completed.
  • diamondUltraSens specifies whether the diamond mode run within orthofinder should be –more-sensitive (default, FALSE) –ultra-sensitive. Not likely to be necessary, except among very diverged genomes … these tend to not have synteny anyways.
  • rawOrthofinderDir file.path to the location of an existing raw orthofinder run. Defaults to the $wd/orthofinder, but can be any path point to a valid orthofinder run.
  • wd file.path where the analysis will be run
  • dotplots character string either “always”, “never”, or “check”. Default (check) only writes a dotplot if there are < 10k unique chromosome combinations (facets). “always” means that dotplots are made regardless of facet numbers, which can be very slow in some instances. “never” is by far the fastest method, but also never produces dotplots.

APPENDIX C: run_genespace pipeline

C.1: Running orthofinder (or parsing existing results)

In this case, OrthoFinder has been run previously, so the output is just parsed (this improves rendering speed by 10x for this tutorial).

C.2: Combining and annotating bed files

Here, HOGs from orthofinder are merged with the bed files. With this data in hand, GENESPACE build ‘tandem arrays’ which are defined as sets of genes on the same chromosome and in the same orthogroup separated by no more than arrayJump interleaved genes. It also flags two sets of problematic genes:

  1. Those on chromosomes with < 2 * minBlkSize unique orthogroups. These are unlikely to have any synteny mapping.
  2. Those that are in “over-dispersed” orthogroups. By definition, an orthogroup that hits > (max(ploidy)) * 8 unique positions, separated by no less than arrayJump orthogroups along a chromosome. Low-quality annotations can often have lots of these types of orthgroups that hit repetitive or low-complexity genomic regions everywhere.

Genes that fall into either of these two categories are flagged as ‘noAnchor’ and are never used in the synteny searches (but can still be syntenic if proximate to a syntenic anchor).

C.3: Combining and annotating the blast files

The gene positions stored in the bed files and orthogroup information are added to the blast files and stored in /syntenicHits. The total number of hits and those where the query and target genes are in the same orthogroup are reported. In addition to building the syntenicHits/$rawblast.txt.gz text files, this also populates the dotplots/$rawHits.pdf

C.4: Flagging synteny

See full synteny methods below. For each pair of genomes, the larger genome is the query and the smaller the target. The total syntenic hits and ‘anchors’ are reported. Syntenic anchors are any set of perfectly collinear blast hits as defined first by MCScanX, then cleaned up by dbscan. Syntenic hits (which are the first number reported) are those proximate to the anchors. See above for definitions of blocks and regions. SVs are defined as the number of block breakpoints - the number of unique chromosome combinations that contain syntenic hits.

C.5: Building synteny-constrained orthogroups

Output here is verbose only if orthofinderInBlk = TRUE. Otherwise, this just reads in all the syntenic orthogroup blast hits and clusters based on the resulting graph.

C.6: Integrating syntenic positions

For each pair of genomes, we conduct linear interpolation of syntenic positions for each gene within the block coordinates. The output reflects the number of overlapping blocks for each gene in each pairwise combination. This should generally be the ploidy of the genome, although can be lower if there are lots of small syntenic blocks (which may have genes in the gaps without a syntenic placement).

C.7: Calculate block coordinates and make riparian plots

By default, we do not construct riparian plots with a polyploid reference - this creates multiple mappings for each color (chromosome) and can be a very confusing and large output file. Because of this, we calculate syntenic blocks phased only against the haploid reference genomes. If there are no haploid references, riparian plots are not constructed.

C.8: Constructing syntenic pan-gene sets

The syntenic hits and orthogroups are decomposed into syntenic position interpolated pan-gene sets. These consiste of unique interpolated positions against a reference (“n pos”), syntenic orthogroup genes (“synOgs”), members of arrays with genes in synOgs (“array mem”) and non-syntenic orthologs to the single representative gene for each pan-gene set (“NS orthos”).

APPENDIX D: Methods details

D.1 Running orthofinder

By default, GENESPACE runs the full OrthoFinder program internally from R. If an orthofinder run is already available (and uses exactly the same genomeIDs as the GENESPACE run), GENESPACE will detect this and not run OrthoFinder.

For very large runs, we highly recommend running OrthoFinder outside of R. You can do this by using init_genespace(..., path2orthofinder = NA). When you use run_genespace with the resulting gs parameters, GENESPACE will prepare the input files and print the OrthoFinder command to the console before killing the run. You can then run that command in your shell or on a server separately.

Once orthofinder is run, you can point to the output directory via init_genespace(..., rawOrthofinderDir = "/path/to/run"), or just put it in the /orthofinder directory in your GENESPACE wd. GENESPACE will find the necessary results, QC them, and if all looks good, move them to the /results directory and begin the pipeline.

D.2 Aggregating orthofinder information

The bed files are concatenated and a genomeID column is added so that all gene position is in one file: /results/combBed.txt. OrthoFinder ofIDs (unique gene identifiers), OGs (orthogroup.tsv) and HOGs (phylogenetically hierarchical orthogroups, N0.tsv) are parsed and added as new columns to the combBed file. This provides all the raw information needed for all further steps.

Problematic genes and chromosomes are flagged by two rules: chromosomes must have > blkSize (default = 5) unique orthogroups. Orthogroups must hit < ploidy * maxOgPlaces (default maxOgPlaces = 8) unique positions in the genome (separated by > synBuffer, default = 150). Genes in either the problematic orthogroups or chromosomes are flagged as noAnchor, meaning they will still be considered for the pangenome steps, but cannot be used as syntenic anchors or to infer sytenic positions across genomes.

With HOGs (or OGs if useHOGs = FALSE) in hand, tandem arrays are calculated as follows: 1) any OG with multiple members on a single genome-by-chromosome combination are ‘potential tandem arrays’; 2) potential tandem arrays with a gap between members > synBuffer are split into their own clusters; 3) clusters with > 1 member are tandem arrays; 4) genes in the most physically central position in the array are ‘arrayReps’. Gene order is re-calculated for just the array reps and steps 1-4 are run iteratively until convergence. Genes that are not array representatives are also falgged as noAnchor.

The information from the combBed file are added to the blast files to produce annotated blast files, aka synHits. Of particular importance is whether both the query and target genes are in the same orthogroup “sameOg” and are not flagged as noAnchor. Reciprocal query/target blast hits are aggregated into a single file where the larger genome is the query and the smaller the target.

D.3 Calculating pairwise synteny

The information stored in the combined/annotated bed file is merged with the blast results to make synHits matrices. The function synteny annotates each unique pairwise synHits file with three columns: “isAnchor”, “inBuffer”, “isSyntenic” and “blkID”. If blkID is NA and inBuffer == FALSE, that blast hit is not syntenic. Otherwise, the hit is syntenic and may be used for direct syntenic position interpolation if isAnchor == TRUE.

Synteny is inferred by the following pipeline:

  1. Hits are culled to the topN hits which is the ploidy of the target genome.
  2. If onlyOgAnchors == TRUE, these topN hits are further culled to only hits where the query and target genomes are in the same phylogenetically hierarchical orthogroup.
  3. Gene rank order positions (order of genes along the chromosomes) are re-calculated for these culled query and target genes separately and MCScanX is run on the culled and re-ordered hits.
  4. Euclidean distances between rank order positions of MCScanX-culled hits and hits (including not topN hits, but potentially excluding non-OG hits if onlyOgAnchors == TRUE) are calculated. Those hits that are within synBuffer radius of MCScanX-culled hits are retained.
  5. MCScanX is re-run on the hits within the buffer radius
  6. Initial syntenic blocks are calculated by 2d clustering via dbscan
  7. Overlapping non-duplicated blocks (e.g. interleaved) are split using run-length equivalent decoding
  8. Hits within block intervals are extracted and those within synBuffer radius of anchors are flagged as “inBuffer”.

APPENDIX E: GENESPACE output specifications

E.1 Blast and synHits files

Starting in v1.0.9, GENESPACE produces two ‘synHits’ files. The first, with the format $QUERY_vs_$TARGET.allBlast.txt.gz contains all blast hits with a set of columns that contain the information needed for synteny (see below). The function finalize_synteny (see below) take only hits with TRUE in the flag isSyntenic and puts these in a simplified text file with the format: $QUERY_vs_$TARGET.synHits.txt.gz, which only have the bitScore column from the blast8 specification and also drops the “sameInblkOg”, “isSyntenic”, and “inBuffer” columns. The splitting out syntenic hits dramatically improves read/write speed for all downstream functions that only see syntenic hits.

allBlast files have the following columns:

columns 1-8: annotation positions of the query genes columns 9-16: annotation positions of the target genes columns 17-26: standard blast8-formatted fields col 27 (sameOg): logical, query and target in same orthogroup? col 28 (noAnchor): logical, should this hit never be an anchor? col 29 (isAnchor): logical, is this hit a syntenic anchor? col 30 (inBuffer): logical, is this hit in the synteny buffer? col 31 (isSyntenic): logical, is this hit in the blkRadius? col 32 (regID): character, large region ID of syntenic hits col 33 (blkID): character, blockID of syntenic hits

synHits files have the following columns:

columns 1-8: annotation positions of the query genes columns 9-16: annotation positions of the target genes column 17: bit score from the blast file col 18 (sameOg): logical, query and target in same orthogroup? col 19 (noAnchor): logical, should this hit never be an anchor? col 20 (isAnchor): logical, is this hit a syntenic anchor? col 21 (regID): character, large region ID of syntenic hits col 22 (blkID): character, blockID of syntenic hits

E.2 Pangenes

The function syntenic_pangenes converts orthogroup and position data into a single matrix. This is stored in the /pangenes directory. This is the raw data that goes into the pangenes output (position x genome matrix). The pangenes.txt file contains 11 columns:

  1. ‘ofID’: orthofinder-given unique gene ID
  2. ‘pgID’: the unique position-by-orthogroup combination identifier. Each row in the pangenome-annotation is a unique value of this field.
  3. ‘interpChr’: the chromosome (syntenic or actual) on the reference genome
  4. ‘interpOrd’: the gene-rank order position (interpolated or actual) on the reference genome
  5. ‘pgRepID’: the ofID for the representative gene for each pgID
  6. ‘genome’: genome of the focal gene (id column)
  7. ‘og’: orthogroup ID taken from the combBed.txt file
  8. ‘flag’: specifies whether the gene is a syntenic ortholog (‘PASS’), non-syntenic ortholog, or array member.
  9. ‘id’: bed-specified gene ID 10-13: chromosome, physical positions, and gene rank order for each id.

The function query_pangenes transforms these data into a more easily read ‘wide’ format: columns 1:5 collow columns 2-5 above. The remaining columns aggregate genes within each genome (column). Genes flagged with ’*’ are non-syntenic orthologs; genes flagged with ‘+’ are array members, but not the most central (representative) gene in the array.

E.3 The combBed.txt file

Here are the first two lines of the combBed.txt file for the human-chicken.

   chr     start       end           id    ofID pepLen  ord  genome  arrayID isArrayRep    globOG                 globHOG noAnchor    og
1:   1 154464218 154465306    LOC418813 0_10171     98 2913 chicken Arr00001       TRUE OG0000006 N0.HOG0000007 OG0000006    FALSE 17571
2:   1 171849899 171861579 LOC112532821  0_8262   1130 3029 chicken Arr00002       TRUE OG0000008 N0.HOG0000009 OG0000008    FALSE 20208

Column definitions:

  1. ‘chr’: chromosome identifier copied from the bed file
  2. ‘start’: gene start position copied from the bed file
  3. ‘end’: gene end position copied from the bed file
  4. ‘id’: gene identifier copied from the bed file
  5. ‘ofID’: unique orthofinder ID taken from SequenceIDs.txt
  6. ‘pepLen’: number of amino acids in the predicted peptide
  7. ‘ord’: gene rank order position along the entire genome
  8. ‘genome’: unique genome ID, taken from the bed file name
  9. ‘arrayID’: tandem array identifier, if the field begins with “NoArr”, then this gene is not part of a tandem array
  10. ‘isArrayRep’: TRUE/FALSE specifying whether a gene is the representative (closest physically to the median position of the array)
  11. ‘globOG’: global orthogroup ID, taken from orthogroups.tsv
  12. ‘globHOG’: global phylogenetically hierarchical orthogroup, taken from N0.tsv
  13. ‘noAnchor’: TRUE/FALSE whether the gene can be considered for synteny
  14. ‘og’: changes depending on the stage of the run. Initially it is the globHOG or globOG. Then after synteny, it is populated with synOG. Lastly, may be populated with inblkOG.

E.4 The syntenic block coordinates

This file, stored in /results/syntenicBlock_coordinates.csv and /results/syntenicRegion_coordinates.csv, contain pairwise block coordinates for each pair of genomes (genome1, genome2; 1 or 2 appended to the column name indicates that this data is associated with either genome1 or genome2 respectively). For each genome, there are 15 columns:

  1. ‘chr’: chromosome ID
  2. ‘minBp’: lowest basepair (bp) position
  3. ‘maxBp’: greatest basepair (bp) position
  4. ‘minOrd’: lowest gene rank order (ord) position
  5. ‘maxOrd’: highest gene rank order (ord) position
  6. ‘minGene’: lowest position gene ID in the block
  7. ‘maxGene’: higest position gene ID in the block
  8. ‘nHits’: number of hits
  9. ‘startBp’: begining bp position (taking into account orientation)
  10. ‘endBp’: end bp position (taking into account orientation)
  11. ‘startOrd’: begining ord position (taking into account orientation)
  12. ‘endOrd’: end ord position (taking into account orientation)
  13. ‘firstGene’: first gene in the block (taking into account orientation)
  14. ‘lastGene’: last gene in the block (taking into account orientation)

For riparian plots, blocks need to be ‘phased’ by the reference genome. These coordinates are stored in /riparian/$GENOME_phasedBlks.csv.